Novel Corona Virus 2020 Dataset Analysis & Prediction

From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. Since the beginning of the coronavirus pandemic, WHO & Our World in Data team are collecting datasets on daily basis the number of COVID-19 cases and deaths, based on reports from health authorities worldwide. To insure the accuracy and reliability of the data, this process is being constantly refined. This helps to monitor and interpret the dynamics of the COVID-19 pandemic not only in the European Union (EU), the European Economic Area (EEA), but also worldwide.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

For this project we are using dataset provided by https://ourworldindata.org/coronavirus-source-data . Here we will be visualizing the current trend of cases and trends in Asia & especially in Nepal. So, let's begin the project by importing the dataset.

In [2]:
covid_data = pd.read_csv('owid-covid-data.csv',sep=',')
In [3]:
covid_data.head()
Out[3]:
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index
0 ABW North America Aruba 2020-03-13 2.0 2.0 NaN 0.0 0.0 NaN ... 35973.781 NaN NaN 11.62 NaN NaN NaN NaN 76.29 NaN
1 ABW North America Aruba 2020-03-19 NaN NaN 0.286 NaN NaN 0.0 ... 35973.781 NaN NaN 11.62 NaN NaN NaN NaN 76.29 NaN
2 ABW North America Aruba 2020-03-20 4.0 2.0 0.286 0.0 0.0 0.0 ... 35973.781 NaN NaN 11.62 NaN NaN NaN NaN 76.29 NaN
3 ABW North America Aruba 2020-03-21 NaN NaN 0.286 NaN NaN 0.0 ... 35973.781 NaN NaN 11.62 NaN NaN NaN NaN 76.29 NaN
4 ABW North America Aruba 2020-03-22 NaN NaN 0.286 NaN NaN 0.0 ... 35973.781 NaN NaN 11.62 NaN NaN NaN NaN 76.29 NaN

5 rows × 41 columns

Let's see how Corona Spread VS Time

In [4]:
covid_data_countrydate = covid_data[covid_data['new_cases']>0]
covid_data_countrydate = covid_data_countrydate.groupby(['date','location']).sum().reset_index()

fig = px.choropleth(covid_data_countrydate, 
                    locations="location", 
                    locationmode = "country names",
                    color="new_cases", 
                    hover_name="location", 
                    animation_frame="date"
                   )

fig.update_layout(
    title_text = 'Spread of Coronavirus',
    title_x = 0.5,
    geo=dict(
        showframe = False,
        showcoastlines = False,
    ))
    
fig.show()

SELECTING ASIA REGION ONLY

Here, we are visualizing the corona datasets only from Asia. So, we will be selecting datasets related to Asia continent only.

In [5]:
covid_data1=covid_data.loc[covid_data['continent'] == 'Asia']
covid_data1
Out[5]:
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index
210 AFG Asia Afghanistan 2019-12-31 0.0 0.0 NaN 0.0 0.0 NaN ... 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83 0.498
211 AFG Asia Afghanistan 2020-01-01 0.0 0.0 NaN 0.0 0.0 NaN ... 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83 0.498
212 AFG Asia Afghanistan 2020-01-02 0.0 0.0 NaN 0.0 0.0 NaN ... 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83 0.498
213 AFG Asia Afghanistan 2020-01-03 0.0 0.0 NaN 0.0 0.0 NaN ... 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83 0.498
214 AFG Asia Afghanistan 2020-01-04 0.0 0.0 NaN 0.0 0.0 NaN ... 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83 0.498
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48631 YEM Asia Yemen 2020-10-09 2054.0 1.0 2.286 594.0 0.0 0.857 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.7 66.12 0.452
48632 YEM Asia Yemen 2020-10-10 2054.0 0.0 1.571 594.0 0.0 0.857 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.7 66.12 0.452
48633 YEM Asia Yemen 2020-10-11 2055.0 1.0 1.571 595.0 1.0 0.714 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.7 66.12 0.452
48634 YEM Asia Yemen 2020-10-12 2055.0 0.0 1.429 596.0 1.0 0.714 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.7 66.12 0.452
48635 YEM Asia Yemen 2020-10-13 2056.0 1.0 1.571 596.0 0.0 0.571 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.7 66.12 0.452

11890 rows × 41 columns

Grouping the data by Country and the Date :

In [6]:
covid = covid_data1.groupby(['location', 'date']).max().reset_index().sort_values('date', ascending=False)
covid = covid.drop_duplicates(subset = ['location'])
covid = covid[covid['total_cases']>0]
covid.head()
Out[6]:
location date iso_code continent total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index
11889 Yemen 2020-10-13 YEM Asia 2056.0 1.0 1.571 596.0 0.0 0.571 ... 1479.147 18.8 495.003 5.35 7.6 29.2 49.542 0.70 66.12 0.452
10700 Timor 2020-10-13 TLS Asia 29.0 1.0 0.143 0.0 0.0 0.000 ... 6570.102 30.3 335.346 6.86 6.3 78.1 28.178 5.90 69.50 0.625
4667 Jordan 2020-10-13 JOR Asia 26073.0 1147.0 1229.857 207.0 16.0 13.857 ... 8337.490 0.1 208.257 11.75 NaN NaN NaN 1.40 74.53 0.735
8683 Saudi Arabia 2020-10-13 SAU Asia 339615.0 348.0 407.000 5068.0 25.0 24.286 ... 49045.411 NaN 259.538 17.72 1.8 25.4 NaN 2.70 75.13 0.853
4446 Japan 2020-10-13 JPN Asia 89673.0 326.0 518.000 1634.0 5.0 4.571 ... 39002.223 NaN 79.370 5.72 11.2 33.7 NaN 13.05 84.63 0.909

5 rows × 41 columns

In [7]:
fig = go.Figure(data=go.Choropleth(
    locations = covid['location'],
    locationmode = 'country names',
    z = covid['new_cases'],
    colorscale = 'Reds',
    marker_line_color = 'black',
    marker_line_width = 0.5
))

fig.update_layout(
    title_text = 'New cases As of October 13 : Asia',
    title_x = 0.5,
    geo=dict(
        showframe = False,
        showcoastlines = False,
        projection_type = 'equirectangular'
    )
)

REMOVING INDIA SINCE ITS STAT IS TOO LARGE

In [8]:
df_no_india = covid[covid['location'] != 'India']
fig = go.Figure(data=go.Choropleth(
    locations = df_no_india['location'],
    locationmode = 'country names',
    z = df_no_india['new_cases'],
    colorscale = 'Reds',
    marker_line_color = 'black',
    marker_line_width = 0.5
))

fig.update_layout(
    title_text = 'New cases As of October 13 : Asia(Excluding India)',
    title_x = 0.5,
    geo=dict(
        showframe = False,
        showcoastlines = False,
        projection_type = 'equirectangular'
    )
)

Drops unnecessary column(s)

In above dataset we have lot's of columns(feature labels). We aren't using all of them in our project so we will select only required columns from our dataset. So let's make it clear at first: We will be using new cases, total cases, new deaths and total cases from Asia & Nepal

In [9]:
df = covid_data1[['continent','location', 'date','new_tests','new_cases','new_deaths','total_cases','total_deaths']]
df
Out[9]:
continent location date new_tests new_cases new_deaths total_cases total_deaths
210 Asia Afghanistan 2019-12-31 NaN 0.0 0.0 0.0 0.0
211 Asia Afghanistan 2020-01-01 NaN 0.0 0.0 0.0 0.0
212 Asia Afghanistan 2020-01-02 NaN 0.0 0.0 0.0 0.0
213 Asia Afghanistan 2020-01-03 NaN 0.0 0.0 0.0 0.0
214 Asia Afghanistan 2020-01-04 NaN 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
48631 Asia Yemen 2020-10-09 NaN 1.0 0.0 2054.0 594.0
48632 Asia Yemen 2020-10-10 NaN 0.0 0.0 2054.0 594.0
48633 Asia Yemen 2020-10-11 NaN 1.0 1.0 2055.0 595.0
48634 Asia Yemen 2020-10-12 NaN 0.0 1.0 2055.0 596.0
48635 Asia Yemen 2020-10-13 NaN 1.0 0.0 2056.0 596.0

11890 rows × 8 columns

Dealing With Missing Data

There might be some missing data and NAN data. So we will be replacing them by Zero. We could have used mean data but it's our first project on sagemaker so i will be keeping it a lot simple

In [10]:
df.fillna(0)
Out[10]:
continent location date new_tests new_cases new_deaths total_cases total_deaths
210 Asia Afghanistan 2019-12-31 0.0 0.0 0.0 0.0 0.0
211 Asia Afghanistan 2020-01-01 0.0 0.0 0.0 0.0 0.0
212 Asia Afghanistan 2020-01-02 0.0 0.0 0.0 0.0 0.0
213 Asia Afghanistan 2020-01-03 0.0 0.0 0.0 0.0 0.0
214 Asia Afghanistan 2020-01-04 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
48631 Asia Yemen 2020-10-09 0.0 1.0 0.0 2054.0 594.0
48632 Asia Yemen 2020-10-10 0.0 0.0 0.0 2054.0 594.0
48633 Asia Yemen 2020-10-11 0.0 1.0 1.0 2055.0 595.0
48634 Asia Yemen 2020-10-12 0.0 0.0 1.0 2055.0 596.0
48635 Asia Yemen 2020-10-13 0.0 1.0 0.0 2056.0 596.0

11890 rows × 8 columns

Corona Virus Analysis In Nepal

CORONA CASES IN NEPAL TILL NOW

In [11]:
df11=covid_data.loc[covid_data['location'] == 'Nepal']
bar_data = df11.groupby(['date'])['new_cases'].sum().reset_index().sort_values('date', ascending=True)

fig = px.bar(bar_data, x="date", y="new_cases", text = 'new_cases', orientation='v', height=600,
             title='Confirmed Cases In Nepal Till October 13')
fig.show()
In [15]:
def plot_var(var='new_deaths',
             location='Nepal'):
    """
    Plots a bar chart of the given variable over the date range
    """
    assert type(var)==str, "Expected string as the variable name"
    assert type(location)==str, "Expected string as the state name"
    
    y = df[df['location']==location][var][-31:]
    x = df[df['location']==location]['date'][-31:]
    plt.figure(figsize=(12,4))
    plt.title("{} for {} In last 30 Days".format(var,location),fontsize=18)
    plt.bar(x=x,height=y,edgecolor='k',color='orange')
    plt.grid(True)
    plt.xticks(fontsize=14,rotation=45)
    plt.yticks(fontsize=14)
    plt.show()

plot_var('new_cases')
plot_var('new_deaths')
In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
covid_data = pd.read_csv('owid-covid-data.csv',sep=',')
df=covid_data.loc[covid_data['location'] == 'Nepal']
df = df[['continent','location', 'date','new_cases','new_deaths','total_cases','total_deaths']]
df.fillna(0)
covidByDay =df.groupby(['date'])[['total_cases']].sum().sort_values('date', ascending=False)
covidByDay.head()
Out[16]:
total_cases
date
2020-10-13 111802.0
2020-10-12 107755.0
2020-10-11 105684.0
2020-10-10 100676.0
2020-10-09 98617.0

TOTAL NUMBER OF CASES IN NEPAL

In [17]:
labels = covidByDay.index.get_level_values(0).values

plt.figure(figsize=(24, 6))
ax = sns.lineplot(data=covidByDay, palette="tab10", linewidth=2.5)
ax.set_xticklabels(labels, rotation=70, horizontalalignment='right')
ax.set_ylabel('Total Cases')
ax.set_title('Cases of COVID-19 In Nepal')
ax.margins(0)

n = 7  # Keeps every 7th label
[l.set_visible(False) for (i,l) in enumerate(ax.xaxis.get_ticklabels()) if i % n != 0]

ax
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1cb4cd3710>

REMOVING SOME ROWS

Since the number of cases for the first few days is very small, there is large fluctuation in the early part of the graph, after which it stabilizes to a nearly straight line. Hence, we choose to ignore the low valued-data in order to develop a with a better fit.

In [18]:
# only work with a limited amount of data
df = df[df['total_cases'] > 10000]
df
Out[18]:
continent location date new_cases new_deaths total_cases total_deaths
34771 Asia Nepal 2020-06-25 1702.0 0.0 10728.0 23.0
34772 Asia Nepal 2020-06-26 434.0 3.0 11162.0 26.0
34773 Asia Nepal 2020-06-27 593.0 1.0 11755.0 27.0
34774 Asia Nepal 2020-06-28 0.0 0.0 11755.0 27.0
34775 Asia Nepal 2020-06-29 1017.0 1.0 12772.0 28.0
... ... ... ... ... ... ... ...
34877 Asia Nepal 2020-10-09 4364.0 12.0 98617.0 590.0
34878 Asia Nepal 2020-10-10 2059.0 10.0 100676.0 600.0
34879 Asia Nepal 2020-10-11 5008.0 14.0 105684.0 614.0
34880 Asia Nepal 2020-10-12 2071.0 22.0 107755.0 636.0
34881 Asia Nepal 2020-10-13 4047.0 9.0 111802.0 645.0

111 rows × 7 columns

In [19]:
ar=list(range(1,112))
df.insert(0,"SN",ar,True)
df
Out[19]:
SN continent location date new_cases new_deaths total_cases total_deaths
34771 1 Asia Nepal 2020-06-25 1702.0 0.0 10728.0 23.0
34772 2 Asia Nepal 2020-06-26 434.0 3.0 11162.0 26.0
34773 3 Asia Nepal 2020-06-27 593.0 1.0 11755.0 27.0
34774 4 Asia Nepal 2020-06-28 0.0 0.0 11755.0 27.0
34775 5 Asia Nepal 2020-06-29 1017.0 1.0 12772.0 28.0
... ... ... ... ... ... ... ... ...
34877 107 Asia Nepal 2020-10-09 4364.0 12.0 98617.0 590.0
34878 108 Asia Nepal 2020-10-10 2059.0 10.0 100676.0 600.0
34879 109 Asia Nepal 2020-10-11 5008.0 14.0 105684.0 614.0
34880 110 Asia Nepal 2020-10-12 2071.0 22.0 107755.0 636.0
34881 111 Asia Nepal 2020-10-13 4047.0 9.0 111802.0 645.0

111 rows × 8 columns

PREPARING DATASET

In [20]:
x1 = np.array(df["SN"]).reshape(-1,1)
y = np.array(df['total_cases']).reshape(-1,1)
In [21]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
In [22]:
print('--'*15,end ='');print('polynomial model training',end ='');print('--'*10)

for i in range(1,6):
    polyfet = PolynomialFeatures(degree=i)
    xa = polyfet.fit_transform(x1)
    model = linear_model.LinearRegression()
    model.fit(xa,y)
    accuracy = model.score(xa,y)
    print('accuracy(R2) with degree_{} is -->  {}%'.format(i , round(accuracy*100,3)))
print('--'*45)
------------------------------polynomial model training--------------------
accuracy(R2) with degree_1 is -->  86.37%
accuracy(R2) with degree_2 is -->  99.155%
accuracy(R2) with degree_3 is -->  99.778%
accuracy(R2) with degree_4 is -->  99.779%
accuracy(R2) with degree_5 is -->  99.935%
------------------------------------------------------------------------------------------
In [23]:
polyfet = PolynomialFeatures(degree=4) #you can change degree
xa = polyfet.fit_transform(x1)
model = linear_model.LinearRegression()
model.fit(xa,y)
yp = model.predict(xa)
yact = np.array(df['total_cases'])#.reshape(-1,1)
In [24]:
plt.figure(figsize=(8, 6)) 
plt.plot(yp,"--b")
plt.plot(yact,"-g")
plt.legend(['pred', 'actual'])
plt.xticks()
# plt.yticks([])
plt.title("comparing actual and pred", fontdict=None, loc='center')
plt.show()

PREDICTING FUTURE TREND OF CORONA CASE IN NEPAL

In [25]:
x_fut = np.arange(30).reshape(-1,1)
xf = x_fut+x1[-1:]
y_fut = (model.predict(polyfet.transform(xf))).astype(int)
In [26]:
plt.figure(figsize=(16, 10)) 
plt.plot(x1,yp,"--b")
plt.plot(x1,yact,"-g")
plt.plot(xf,y_fut,"--r")
plt.legend(['predicted', 'actual',"future_pred"])
plt.xticks()

plt.title("comparing actual and pred", fontdict=None, loc='center')
plt.show()

PREDICTION VALUES BY DAYS

In [27]:
#prediction after 7 days
days = 7
print("Corona Cases after {} day - ".format(days), end='')
print(round(int(model.predict(polyfet.transform(np.array(x1[-1:]+days).reshape(-1,1)))),2))
Corona Cases after 7 day - 125281
In [28]:
#prediction of corona cases after 30 days
days = 30
print("Corona Cases after {} day - ".format(days), end='')
print(round(int(model.predict(polyfet.transform(np.array(x1[-1:]+days).reshape(-1,1)))),2))
Corona Cases after 30 day - 201812

Thank you

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: